SPAM: Sample data file with walkthrough of the algorithm.

Note: Before running this tutorial, make sure that the Spam executable
file is in your operating system's PATH environment variable; the sample
directory does not include a separate copy of the executable file.


# Input file #######################################################
The sample input file testin.txt contains three customers. 
The first customer has 4 transactions, the second 3 transactions, 
and the third 2 transactions. Each transaction consists of two items,
items number 1 and 2. 
 
To create an input data file such as this one, just go into any
ASCII text editor and put each line in the following format:
<CustID><space><TransID><space><ItemID>\n. Make sure that there is not 
an extra line at the end of the input file. 


# Running the algorithm ###########################################
On the command line, run Spam with the following options: 
spam -fn testin.txt -sup 0.6 -outFile testout.txt -ascii


# Interpreting the output ######################################### 
Spam generates two files in the same directory: summary.txt and testout.txt.

summary.txt should look like the following:
--------------
Number of customer: 3
Minimum support: 2 ( 0.6 )
Mining Duration:0.01
Program Duration: 0.12

Number of Compression: 0
Number of Ors: 0
Number of Ands: 110
Number of Count: 114
Number of CountZeros: 0
Number of CountSmaller: 0
Number of CreateSBitmap: 39
Number of CreateCBitmaps: 0
--------------

You will probably be most interested in the four top-most values. 
The minimum support percentage of 0.6 is represented here as a 1, 
indicating that 2 out of the 3 customers must have a sequence in order 
for the sequence to be considered frequent. The Mining Duration is the actual 
running time of the algorithm, while the Program Duration is the algorithm 
running time plus the file I/O running time.


testout.txt should look like the following:
-------------
1 - 3
1 -1 1 - 3
1 -1 1 -1 1 - 2
1 -1 1 -1 1 2 - 2
1 -1 1 -1 2 - 2
1 -1 1 2 - 3
1 -1 1 2 -1 1 - 2
1 -1 1 2 -1 1 2 - 2
1 -1 1 2 -1 2 - 2
1 -1 2 - 3
1 -1 2 -1 1 - 2
1 -1 2 -1 1 2 - 2
1 -1 2 -1 2 - 2
1 2 - 3
1 2 -1 1 - 3
1 2 -1 1 -1 1 - 2
1 2 -1 1 -1 1 2 - 2
1 2 -1 1 -1 2 - 2
1 2 -1 1 2 - 3
1 2 -1 1 2 -1 1 - 2
1 2 -1 1 2 -1 1 2 - 2
1 2 -1 1 2 -1 2 - 2
1 2 -1 2 - 3
1 2 -1 2 -1 1 - 2
1 2 -1 2 -1 1 2 - 2
1 2 -1 2 -1 2 - 2
2 - 3
2 -1 1 - 3
2 -1 1 -1 1 - 2
2 -1 1 -1 1 2 - 2
2 -1 1 -1 2 - 2
2 -1 1 2 - 3
2 -1 1 2 -1 1 - 2
2 -1 1 2 -1 1 2 - 2
2 -1 1 2 -1 2 - 2
2 -1 2 - 3
2 -1 2 -1 1 - 2
2 -1 2 -1 1 2 - 2
2 -1 2 -1 2 - 2
-------------

Each line describes a frequent sequence and the number of customers who had that 
sequence. The -1's denote the end of each itemset in the sequence, and the 
number after the last "-" is the number of customers that had the sequence. 
Take for example the line
1 2 -1 1 -1 1 2 - 2
This corresponds to the sequence (1,2),(1),(1,2) being found for 2 out of the 3 
customers (and therefore a support percentage of 2/3, greater than 0.6).

#######################################
Questions?
Email Jay Ayres at kja9@cornell.edu, or Manuel Calimlim at calimlim@cs.cornell.edu.